-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: continuous performance monitoring and PR comment #6283
Conversation
82c1e40
to
4146960
Compare
paths: ['.github/workflows/bench.yml', '**/CMakeLists.txt', '**/Makefile', '**/*.h', '**/*.hpp', '**/*.c', '**/*.cpp', '**/*.cu', '**/*.swift', '**/*.m', 'examples/server/bench/**.*'] | ||
pull_request: | ||
types: [opened, synchronize, reopened] | ||
paths: ['.github/workflows/bench.yml', '**/CMakeLists.txt', '**/Makefile', '**/*.h', '**/*.hpp', '**/*.c', '**/*.cpp', '**/*.cu', '**/*.swift', '**/*.m', 'examples/server/bench/**.*'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about excluding examples/
subdirectories except for examples/server
? It could help reduce unneeded runs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do it in another PR if you dont' mind/
schedule: | ||
- cron: '04 2 * * *' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this scheduled run? If so, how will we view the results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment, it will do the steps not related to PR: commit status and upload artefact. I will later process all commit checks statuses to show performance improvements day after day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds awesome 👍 It would also be cool if we pile up the daily performance results somewhere and visualize the performance improvement.
Also, if scheduled run's role becomes to differ from PR-based runs too much, consider making it a separate workflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I want to do something like, probably stored on GH pages.
https://home.apache.org/~mikemccand/lucenebench/indexing.html
But it will require a little time and logic to reprocess previous commits, taking into account parameters have changed :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think we should put much effort in reprocessing previous commits. Better to focus just on the new versions from now on
@ggerganov Hi Georgi, I think it is OK for a first version, please review the changes |
Cannot create a token with |
@ggerganov I see, just create a classic token as I did on the fork will work. Also please add: |
Also, probably it's better if you try to start the github runner manager yourself on the Azure T4 node: git clone https://github.com/ggml-org/ci.git
cd ci
git remote add phymbert https://github.com/phymbert/ci.git
git fetch phymbert
git checkout hp/github-runner
./start-github-runner-manager.sh ggerganov/llama.cpp $TOKEN Standard_NC4as_T4_v3 |
A classic token with I tried to make a fine-grained token with the following config: But I get an error:
|
Yes I did not managed with fined grained token. Up to you, I do not see other option. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, only need minor changes
wget --quiet https://github.com/prometheus/prometheus/releases/download/v2.51.0/prometheus-2.51.0.linux-amd64.tar.gz | ||
tar xzf prometheus*.tar.gz --strip-components=1 | ||
./prometheus --config.file=examples/server/bench/prometheus.yml & | ||
while ! nc -z localhost 9090; do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should add a timeout here, just in case something goes wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The workflow will be killed after a while. If you don't mind it can be added later on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's not very important, but I still prefer not to rely on CI timeout because it can be long (usually minutes or hours), we should add a timeout of 10 seconds here for example.
Co-authored-by: Xuan Son Nguyen <[email protected]>
@ggerganov Please note that the github default runner that we are currently using has a global token for the whole repo (GITHUB_TOKEN), so adding another self hosted runner with the same privilege does not hurt.
We can add some checks in the runner manager to see on which branch the workflow will run, which author, and minimum approval. Am I missing something ? Tell me if you want me to implement some security changes. I agree it might still be some work in progress @ngxson if you have a better idea, this is very welcomed EDIT: the token is only used in the manager to get a jitconfig, the runner will use GITHUB_TOKEN secret as all others runners. SO feel free to remove my ssh public key. |
The github token used by The But anyway this permission is still very powerful, because it allows change list of collaborators, description of the repo,... So it should still be kept secured. And yes anyone having access to the VM can see the token, but I don't think it's a problem. Gitlab runner works the same way (token need to be hard-written inside the VM). Also for now only you and @ggerganov have access to the VM so there's no problem. |
The required token is too powerful - it can be used to delete the repo at the very least. So I'm hesitant to give access to it It seems that the only option that we have atm is to limit access to the node just to myself and start the self-hosted manager. Will try to do this today or tomorrow |
Thanks Georgi, I am definitely all in for the least priviledge approach, especially when there is CI automation, human mistakes always happen. If you have time, could you simply delete the work VM and create a fresh one ? I am just imagining an other option:
Will it work ? less convenient but more secured. |
Yeah seems like a good idea. Just remind that you can use change the |
The problem is how the workflow will be triggered this way. I need to think a little bit more if it's possible to schedule the workflow on another repo without git sync |
My idea is: From llama.cpp, we can send a request to ggml-org to tell it to trigger the pipeline. Imagine that it's a bit like our "Publish Docker image" step that make a call to the registry outside of the runner. This approach requires llama.cpp to keep a ggml-org's token as secret, but the token only has |
@ggerganov please review the added comment and I think we are good |
It looks we have our baseline ^^but wondering why this VM is slower than before:
Is it the same CPU/RAM/T4 ? |
Yes, it's Btw, I'm not super confident about the PR comments - might get annoying to have those on every PR. For now lets put everything after the bullet points in |
Done, note there is only one comment per PR |
) * server: bench: init * server: bench: reduce list of GPU nodes * server: bench: fix graph, fix output artifact * ci: bench: add mermaid in case of image cannot be uploaded * ci: bench: more resilient, more metrics * ci: bench: trigger build * ci: bench: fix duration * ci: bench: fix typo * ci: bench: fix mermaid values, markdown generated * typo on the step name Co-authored-by: Xuan Son Nguyen <[email protected]> * ci: bench: trailing spaces * ci: bench: move images in a details section * ci: bench: reduce bullet point size --------- Co-authored-by: Xuan Son Nguyen <[email protected]>
) * server: bench: init * server: bench: reduce list of GPU nodes * server: bench: fix graph, fix output artifact * ci: bench: add mermaid in case of image cannot be uploaded * ci: bench: more resilient, more metrics * ci: bench: trigger build * ci: bench: fix duration * ci: bench: fix typo * ci: bench: fix mermaid values, markdown generated * typo on the step name Co-authored-by: Xuan Son Nguyen <[email protected]> * ci: bench: trailing spaces * ci: bench: move images in a details section * ci: bench: reduce bullet point size --------- Co-authored-by: Xuan Son Nguyen <[email protected]>
) * server: bench: init * server: bench: reduce list of GPU nodes * server: bench: fix graph, fix output artifact * ci: bench: add mermaid in case of image cannot be uploaded * ci: bench: more resilient, more metrics * ci: bench: trigger build * ci: bench: fix duration * ci: bench: fix typo * ci: bench: fix mermaid values, markdown generated * typo on the step name Co-authored-by: Xuan Son Nguyen <[email protected]> * ci: bench: trailing spaces * ci: bench: move images in a details section * ci: bench: reduce bullet point size --------- Co-authored-by: Xuan Son Nguyen <[email protected]>
Motivation
In the context of:
starts the existing k6 script benchmark on the azure node using:
Then add a PR comment, example:
Attach results, image and logs as job artefacts:
Set the commit status with a minimized JSON results for later reprocessing.
Tested in:
TODO:
IMGUR_CLIENT_ID
, see instructionsactions:read
andworkflows
(!!! The required token is too powerful - it can be used to delete the repo at the very least), a less risky approach to be implemented later on